Incorporating window-based passage-level evidence in document retrieval

نویسندگان

  • Wensi Xi
  • Richard Xu-Rong
  • Christopher S. G. Khoo
  • Ee-Peng Lim
چکیده

This study investigated whether information retrieval can be improved if documents are divided into smaller subdocuments or passages, and the retrieval score for these passages are incorporated in the final retrieval score for the whole document. The documents were segmented by sliding a window of a certain size across the document. Each time the window stopped, it displayed/extracted a certain number of contiguous words. A retrieval score was calculated for each of the passages extracted, and the highest score obtained by a passage of that size was taken as the document’s “window score” for that window size. A range of window sizes were tried. The experimental results indicated that using a fixed window size of 50 gave better results than other window sizes for the TREC test collection. This window size yielded a significant retrieval improvement of 24% compared to using the wholedocument retrieval score. However, combining this window score and the wholedocument retrieval score did not yield a retrieval improvement. Identifying the highest window score for each document (using window sizes varying from 50 to 400 words), and adopting it as the document retrieval score yielded a retrieval improvement of about 5% over taking the size-50 window score. Different window sizes were found to work best for different queries. If we could predict accurately the best window size to use for each query, a maximum retrieval improvement of 42% could be obtained. However, an effective way has not been found for predicting which window size would give the best results for each query.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Passage-level Evidence for Cross-Language Information Retrieval

Machine translation (MT) techniques can be used to generate a query in a target language from a query in a source language for the cross-language information retrieval (CLIR). Recent MT systems have advanced enough to generate translations which are human-readable, However, translation error is still a serious impediment which hurts the effectiveness of a CLIR system. To compensate for defects ...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

IIT TREC 2006: Genomics Track

For the TREC-2006 Genomics Track, we report on the effectiveness of composite information retrieval functions based on a dimensional data model for improving document, passage, and aspect search precision of genomics literature. We designed an approach, and developed a corresponding search engine, based on a novel dimensional data model capable of document, paragraph, sentence, and passage leve...

متن کامل

Comparing Document Segmentation Strategies for Passage Retrieval in Question Answering

Information retrieval (IR) techniques are used in question answering (QA) to retrieve passages from large document collections which are relevant to answering given natural language questions. In this paper we investigate the impact of document segmentation approaches on the retrieval performance of the IR component in our Dutch QA system. In particular we compare segmentations into discourse-b...

متن کامل

Enhancing Relevance Models with Adaptive Passage Retrieval

Passage retrieval and pseudo relevance feedback/query expansion have been reported as two effective means for improving document retrieval in literature. Relevance models, while improving retrieval in most cases, hurts performance on some heterogeneous collections. Previous research has shown that combining passage-level evidence with pseudo relevance feedback brings added benefits. In this pap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Information Science

دوره 27  شماره 

صفحات  -

تاریخ انتشار 2001